This dataset has 17 columns and 1000 rows,
This dataset has 3 months data starting from January-March
Also it has 3 branches data naming A,B,C from 3 different cities Yangon, Naypyitaw and Mandalay.
This project is going to analyze thorugh differetn models to find customer behavior in Supermarket. We will be using Data Mining Techniques for that.

There will be 5 stages involved in this project:
1. Preprocessing
2. Data Visulizationn
3. Data Analysis
4. Model Creation
5. Discussion and Conclusion
#Dataset Source: https://www.kaggle.com/aungpyaeap/supermarket-sales
# This Python 3 environment comes with many helpful analytics libraries installed
#Import Libraries
import numpy as np
import seaborn as sns
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from datetime import datetime
import itertools
import statsmodels.tsa.api as smt
from sklearn.metrics import r2_score, median_absolute_error, mean_absolute_error
from sklearn.metrics import median_absolute_error, mean_squared_error, mean_squared_log_error
from tqdm import tqdm_notebook
from itertools import product
from statsmodels.tsa.arima_model import ARMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.holtwinters import ExponentialSmoothing
from statsmodels.tsa.holtwinters import HoltWintersResults
from datetime import datetime
%matplotlib inline
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from pandas.plotting import autocorrelation_plot
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.tsa.arima_model import ARIMA, ARMA
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error,silhouette_score
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
import seaborn as sns
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (25,10)
plt.rcParams['font.size'] = 15
from typing import Dict, List
from datetime import datetime, timedelta
plt.style.use("ggplot")
from matplotlib.colors import ListedColormap
# Aliases
from pandas.core.series import Series
np.random.seed(10000) # Inital seeding
from pandas import Timestamp
from pandas.core.frame import DataFrame
#Set style
sns.set()
from pylab import rcParams
rcParams['figure.figsize'] = 12, 7
#To ignore some unimportant arnings
import warnings
warnings.filterwarnings('ignore')
#importing dataset
sales = pd.read_csv(r"C:\Users\tahmi\Documents\GitHub\DataMiningGroupProject\supermarket_sales.csv")
#checking headings
sales.head()
#checking data info
sales.info()
By inspection, the 'Date' datatype is an object, we need to change it to datetime
#process of converting date column in ISO format
def to_isoformat(row:Series) -> Series:
"""
Note: Original date format MM/DD/YYYY, Concat ['Date'] + ['Time']
This funciton converts the given rows into isoformat
:row: given a row vector that contains Date and Time
:return: a Series of datetime vectors
"""
date: Series = row[0].split("/")
time: Series = row[1]
date[0] = date[0] if eval(date[0]) >=10 else f"0{date[0]}"
date[1] = date[1] if eval(date[1]) >=10 else f"0{date[1]}"
seconds: int = int( np.random.randint(10,60,size=1) )
return datetime.fromisoformat( f"{date[-1]}-{date[0] }-{date[1]}T{time}:{seconds}")
def get_col_desc(col_one:List, col_two: List[str]) -> Dict:
"""
This is a helper function, that return the map description of the given 2 columns
:col_one: given a list of column data
:col_two: given a list of column names
:return: a dictionary that contains column attributes
"""
if len(col_one) != len(col_two):
raise ValueError("Column one must be unique.")
return dict(zip(col_two, col_one ))
#sales['date'] = pd.to_datetime(sales['Date'])
sales['Date'] = pd.to_datetime( sales[['Date','Time']].apply(to_isoformat,axis=1) ) # Concat Date and Time together to see the sale trend
sales.set_index('Date', inplace=True)
sales.drop('Time',inplace=True, axis=1)
#checking the column is not object
sales.dtypes
#verify result
sales.head()
#data description
sales.describe()
#check for missing values
sales.isnull().sum()
#Count function for cities
sns.set(style="darkgrid") #style the plot background to become a grid
cityCount = sns.countplot(x="City", data =sales).set_title("City_counts")
#Gender Count
plt.figure(figsize=(14,6))
plt.style.use('fivethirtyeight')
ax= sns.countplot('Gender', data=sales , palette = 'copper')
ax.set_xlabel(xlabel= "Gender",fontsize=18)
ax.set_ylabel(ylabel = "Gender count", fontsize = 18)
ax.set_title(label = "Gender count in supermarket", fontsize = 20)
plt.show()
#Calculate Gender
sales.groupby(['Gender']). agg({'Total':'sum'})
#Count Customer Type
plt.figure(figsize= (14,6))
ax = sns.countplot(x = "Customer type", data = sales, palette = "rocket_r")
ax.set_title("Type of customers", fontsize = 25)
ax.set_xlabel("Customer type", fontsize = 16)
ax.set_ylabel("Customer Count", fontsize = 16)
#Check Customer Type in Different Branch
plt.figure(figsize=(14,6))
ax = sns.countplot(x = "Customer type", hue = "Branch", data = sales, palette= "rocket_r")
ax.set_title(label = "Customer type in different Branch", fontsize = 25)
ax.set_xlabel(xlabel = "Branches", fontsize = 16)
ax.set_ylabel(ylabel = "Customer Count", fontsize = 16)
#Checking Ratings of each bracnch
sns.boxplot(x="Branch", y = "Rating" ,data =sales).set_title("Ratings by Branch")
Branch B has the lowest rating among all the branches
#Checking Product Sale per City
genderCount = sns.lineplot(x="Branch", y = 'Quantity',data =sales).set_title("Product Sales per City")
#Checking Product Sales per city
genderCount = sns.lineplot(x="City", y = 'Quantity',data =sales).set_title("Product Sales per City")
#Rating of 3 different branches
plt.figure(figsize=(14,6))
ax = sns.boxplot(x="Branch", y = "Rating" ,data =sales, palette= "RdYlBu")
ax.set_title("Rating distribution between Branches", fontsize = 25)
ax.set_xlabel(xlabel = "Branches", fontsize = 16)
ax.set_ylabel(ylabel = "Rating distribution", fontsize = 16)
#Checking which payment method is highly used
plt.figure(figsize = (14,6))
ax = sns.countplot(x = "Payment", data = sales, palette = "tab20")
ax.set_title(label = "Payment methods used by Customers ", fontsize= 25)
ax.set_xlabel(xlabel = "Payment method", fontsize = 16)
ax.set_ylabel(ylabel = " Customer Count", fontsize = 16)
#Check Payments of 3 branches
plt.figure(figsize = (14,6))
ax = sns.countplot(x="Payment", hue = "Branch", data = sales, palette= "tab20")
ax.set_title(label = "Payment distribution in all branches", fontsize= 25)
ax.set_xlabel(xlabel = "Payment method", fontsize = 16)
ax.set_ylabel(ylabel = "Peple Count", fontsize = 16)
#Cheking the Total Amount of Sale Vs Rating
plt.figure(figsize=(14,6))
rating_vs_sales = sns.lineplot(x="Total", y= "Rating", data=sales)
#Average Sales from Product Lines
plt.figure(figsize=(10,6))
ax = sns.boxenplot(x = "Quantity", y = "Product line", data = sales,)
ax.set_title(label = "Average Sales of Different Lines of Products", fontsize = 25)
ax.set_xlabel(xlabel = "Qunatity Sales",fontsize = 16)
ax.set_ylabel(ylabel = "Product Line", fontsize = 16)
#Sales from each category of Product/Product Line
plt.figure(figsize=(14,6))
ax = sns.countplot(y='Product line', data=sales, order = sales['Product line'].value_counts().index)
ax.set_title(label = "Sales count of products", fontsize = 25)
ax.set_xlabel(xlabel = "Sales count", fontsize = 16)
ax.set_ylabel(ylabel= "Product Line", fontsize = 16)
#Total Sales from Product Line in boxplot
plt.figure(figsize=(14,6))
ax = sns.boxenplot(y= "Product line", x= "Total", data = sales)
ax.set_title(label = " Total sales of product", fontsize = 25)
ax.set_xlabel(xlabel = "Total sales", fontsize = 16)
ax.set_ylabel(ylabel = "Product Line", fontsize = 16)
#Checking average ratings of product line
plt.figure(figsize = (14,6))
ax = sns.boxenplot(y = "Product line", x = "Rating", data = sales)
ax.set_title("Average rating of product line", fontsize = 25)
ax.set_xlabel("Rating", fontsize = 16)
ax.set_ylabel("Product line", fontsize = 16)
#Product sales on the basis of Gender
plt.figure(figsize = (14,6))
ax= sns.stripplot(y= "Product line", x = "Total", hue = "Gender", data = sales)
ax.set_title(label = "Product sales on the basis of gender")
ax.set_xlabel(xlabel = " Total sales of products")
ax.set_ylabel(ylabel = "Product Line")
#Checking Gross Income from Product Line
plt.figure(figsize = (14,6))
ax = sns.relplot(y= "Product line", x = "gross income", data = sales)
# ax.set_title(label = "Products and Gross income")
# ax.set_xlabel(xlabel = "Total gross income")
# ax.set_ylabel(ylabel = "Product line")
#Cheking Correlation
sales.corr()
#Checking Covariance
sales.cov()
#Heatmap for Correlation between Sales Metrics
plt.figure(figsize=(12,8))
sales_heatmap= sns.heatmap(sales.corr(),annot=True,
linewidths=.5,cmap='RdBu')
plt.title('Heatmap for Correlation between Sales Metrics')
#Analysis of City vs Total
df_test=sales[['City','Total']]
df_grp=df_test.groupby(['City'],as_index=False).sum()
df_grp
Analysis
Based on the sum of the 3 branches of the supermarket, Naypyitaw is generating the most sales revenue, followed by Yangon with Mandalay with almost similar revenue.
#analysing best performing city branch based on gross income generated
df_pro=sales[['City','gross income']]
df_grp1=df_pro.groupby(['City'],as_index=False).sum()
df_grp1
Analysis
Based on the total income generated of the three supermarket, Naypyitaw has generated the maximum profit followed by Mandalay and Yangon with similar incomes.
#analysing the sales revenue for each product line
df_sal=sales[['Product line','Total']]
df_sal_grp= df_sal.groupby('Product line', as_index=False).sum()
df_sal_grp
#visualising sales revenue for each product line
plt.figure(figsize=(12,6))
sns.barplot(x='Total',y='Product line',data=df_sal_grp,palette='autumn')
plt.title('Sales revenue based on Product Line',fontsize=20)
plt.xlabel('Total Sales',fontsize=14)
plt.ylabel('Product Line',fontsize=14)
Analysis
Based on the above representation we can see that Food and Beverages as brought in the most sales revenue and Health and Beauty products have brought in the least revenue.
#Comparision of Product lines purchased based on Gender
plt.figure(figsize=(12,8))
sns.countplot(x='Product line', hue='Gender',data= sales,
palette='spring')
plt.title('Comparision of Product lines purchased based on Gender',fontsize=20)
plt.xlabel('Product Line',fontsize=15)
plt.ylabel('Count',fontsize=15)
Analysis
Based on the representation of the bar graph above, it can be inferred that over the 3 branch locations of the supermarket there are 6 categories of products being sold with categories such as 'Health and Beauty', 'Electronic accessories' and 'Home and Lifestyle' products are more popular categories among men. Whereas, 'Sports and Travel', 'Food and Beverage', 'Fashion Accessories' products are more popular categories among women.
'Health and Beauty' products is the most preferred category of products for Men. Whereas, 'Fashion Accessories' are the most in demand by women as compared to other categories.
#analysing if the sales revenue generated is more through members or normal customers
df_cust=sales[['Customer type','Total']]
df_grp_cust=df_cust.groupby('Customer type',as_index=False).sum()
df_grp_cust
#analysing the average spend of members vs normal customers
df_grp_cust_avg=df_cust.groupby('Customer type',as_index=False).mean()
df_grp_cust_avg
Analysis
From the above analysis it can be concluded that the 'Member customers' of the supermarket bring in more revenue than the normal customers. However the difference isn't very significant between the two.
Similarly, Member Customers have an average spending of (327.79) higher than the average spending of a Normal customer(318.12)
#analysing the most profitable product line in each of the city branches
df_prof_prod=sales[['City','Product line','gross income']]
df_grp_prof_prod= df_prof_prod.groupby(['City','Product line'], as_index=False).sum()
df_grp_prof_prod_piv= df_grp_prof_prod.pivot(index='City',columns='Product line')
df_grp_prof_prod_piv
#visualzing the most profitable product line in each of the city branches
plt.figure(figsize=(14,12))
sns.barplot(x='City',y='gross income',hue='Product line',
data= df_grp_prof_prod, palette='Set2')
plt.title('Most Profitable Product Lines in each City branch',fontsize=22)
plt.xlabel('City',fontsize=18)
plt.ylabel('Gross Income',fontsize=18)
plt.xticks(fontsize=12)
Branch Analysis**
Product Line Analysis**
#average rating of each city branch based on their product line
df_rating= sales[['City','Product line','Rating']]
df_rating_city=df_rating.groupby(['City','Product line'], as_index=False).mean()
df_rating_city_piv=df_rating_city.pivot(index='City',columns='Product line')
df_rating_city_piv
sns.swarmplot(x = sales['Branch'], y = sales['Rating']).set_title('Branch wise Ratings')
#visualising the best city branch based on their overall ratings
plt.figure(figsize=(12,8))
sns.boxplot(x='Rating',y='City',data= sales)
plt.title('Analysing the City branch ratings by customers', fontsize=20)
plt.xlabel('Rating',fontsize=14)
plt.ylabel('City',fontsize=14)
Analysis
The average rating of Naypyitaw is the highest amongst the three branches with Mandalay having the least average rating. The minimum rating that any branch gets is about 4.0 with the highest being 10.0.
Analysing the performance based on customer ratings of all branches, Yangon and Naypyitaw seem to be doing the best with very similar rating range.
#visualising the product line with the best rating per city branch
plt.figure(figsize=(20,12))
sns.barplot(x='City',y='Rating',hue='Product line',data= sales)
plt.title('Analysing the Product line ratings of each city branch', fontsize=20)
plt.xlabel('City',fontsize=15)
plt.ylabel('Rating',fontsize=15)
Analysis**
#analysing quantity of products sold every month
df_mon=sales[['Product line','City','Quantity']]
df_mon_grp= df_mon.groupby(['Product line','City'],as_index=False).sum()
df_mon_piv= df_mon_grp.pivot(index='Product line',columns='City')
df_mon_piv
Analysis: Naypytitaw: High Sale of Food & Beverage, Fashion Accessories, Electronic Accessories Yangon: Sports & Travel, Electronic Acccessories, Food & Beverages Mandalay: Sports & Travel, Health & Beauty, Electronic Accessories
#visualizing the monthly product quantity sold
plt.figure(figsize=(12,8))
sns.lmplot(x='Product line',y='Quantity',hue='City',fit_reg=False,
data=df_mon_grp)
plt.title('Quantity of Product line sold in each city',fontsize=15)
plt.xlabel('Product Line',fontsize=12)
plt.ylabel('Product Quantity',fontsize=12)
plt.xticks(rotation=90)
Analysis
As per the above analysis:
Mandalay: This city has top sells from these prodcuts categories: Electronic, Health & Beauty, Sports & Travel. Naypyitaw: This city has top sells from Food & Beverage, Fashion Accessories and Electronic Accessories sequentially. Yangon: This city has high sells from Home and Lifestyle, Sports and Travel and Electronic Accessories accordingly.
Let's look at the various products' performance.
sns.countplot(y = 'Product line', data=sales, order = sales['Product line'].value_counts().index )
sns.boxenplot(y = 'Product line', x = 'Quantity', data=sales )
From the above visual, Health and Beauty,Electronic accessories, Homem and lifestyle, Sports and travel have a better average quantity sales that food and beverages as well as Fashion accessories.
From the above image shows the top product line item type sold in the given dataset. Fashion Accessories is the highest while Health and beauty is the lowest
sns.boxenplot(y = 'Product line', x = 'Total', data=sales )
sns.boxenplot(y = 'Product line', x = 'Rating', data=sales )
sns.stripplot(y = 'Product line', x = 'Total', hue = 'Gender', data=sales )
sns.relplot(y = 'Product line', x = 'gross income', data=sales )
Food and Beverages have the highest average rating while sports and travel the lowest
Let's see when customers buy certain products in the various branches.
From the above plots, we can see that food and beverages sales usually high in all three branches at evening especially around 19:00
Let see how customers make payment in this business
sns.countplot(x="Payment", data =sales).set_title("Payment Channel")
Most of the customers pay through the Ewallet and Cash Payment while under 40 percent of them pay with their credit card. We would also like to see this payment type distribution across all the branches
sns.countplot(x="Payment", hue = "Branch", data =sales).set_title("Payment Channel by Branch")
From inspection, there are two types of customers. Members and Normal. Let's see how many they are and where they are
sales['Customer type'].nunique()
sns.countplot(x="Customer type", data =sales).set_title("Customer Type")
sns.countplot(x="Customer type", hue = "Branch", data =sales).set_title("Customer Type by Branch")
sales.groupby(['Customer type']).agg({'Total': 'sum'})
sns.barplot(x="Customer type", y="Total", estimator = sum, data=sales)
Do the customer type influence customer rating? Let's find out
sns.swarmplot(x="Customer type", y = "Rating", hue = "City", data =sales).set_title("Customer Type")
With the use of google search, I was able to get the longitude and latitude of each cities. We can
long = {"Yangon": 16.8661, "Naypyitaw": 19.7633, "Mandalay": 21.9588 }
lat = {"Yangon": 96.1951, "Naypyitaw": 96.0785, "Mandalay": 96.0891 }
for set in sales:
sales['long'] = sales['City'].map(long)
sales['lat'] = sales['City'].map(lat)
sns.scatterplot(x="long", y = "lat",size = "Total", data =sales, legend = "brief").set_title("Customer Type")
So, we understand that customer type has influence on sales
Answer: The busiest days are the following
# fig, ax = plt.subplots(1, len(business_dates), sharey=True)
business_dates: List[Timestamp] = pd.DataFrame(sales.Total.resample('1D').max()).sort_values('Total', ascending=False).index.tolist()
for index,business_date in enumerate(pd.DataFrame(sales.Total.resample('1D').max()).sort_values('Total', ascending=False).index):
busy_day: str = str( business_date).split(' ')[0]
#print(f"[\033[92m+\033[0m] Business Date: {busy_day}")
# Todo: Get store branch and location
sales[busy_day]['Total'].plot()
plt.title(f"Sales on {busy_day}")
plt.xlabel("Sales by hours")
plt.ylabel("Total sales")
plt.savefig(f"busy_store_{busy_day}.jpg")
plt.show()
sales[ str(list(filter(lambda operation_date: str(operation_date).split(' ')[0] == "2019-01-01", business_dates) )[0] ).split(" ")[0] ].groupby('Branch')['Total'].agg(['max', 'mean','min']).plot(kind='bar')
plt.title("Sales on new year day")
plt.grid(True)
round( sales.describe()['Total'].to_dict().get('mean'), 4 )
sales['Customer type'].map({'Member': 'Member', 'Normal': 'Non-Member'})
sales.groupby('Customer type').agg(['max'])['Total'].plot(kind='bar',colormap=ListedColormap(["#0580FB"]), rot=0)
plt.yticks(range(0,1100, 50))
plt.title("Total Sales between Members vs Non Members")
plt.ylabel("Total Sales")
plt.grid(True)
pd.DataFrame( sales.groupby('Branch')['Total'].resample('M').mean() )
pd.DataFrame( sales.groupby('Branch')['Total'].resample('M').mean() ).plot(kind='bar') # .plot(kind='bar') #.hist(by='Total',sharey=True)
sales.groupby('Branch').agg(['min','count'])['Total']
sales.groupby('Branch').agg(['min','count'])['Total'].plot(kind='bar')
plt.grid(True)
plt.title("Branch with the most customer but lower profit ")
for index,month in enumerate(range(1,4)):
sales[f'2019-0{month}'].groupby('Product line')['Total'].agg(['min','mean', 'max']).plot(kind='bar')
map_months: List[str] = list( map(lambda month: 'January' if month == 1 else 'February' if month == 2 else 'March', range(1,4)) )
plt.title(f"Product Line Sales in Month of {map_months[index]}")
plt.ylabel("Total Sales")
# Clusters sales based on the following columns
col_names: List[str] = sales.columns.tolist()
sales[[col_names[7], col_names[9]]] = sales[[col_names[7], col_names[9]]].astype(float)
clusters: List[str] = [col_names[7], col_names[9]] + col_names[13:]
pd.plotting.scatter_matrix( sales[clusters] , figsize=(150,150))
#Printing the Cluter Score
km_sales = KMeans(n_clusters=np.random.randint(5,10), random_state=np.random.randint(100,150))
km_sales.fit(sales[clusters])
# Number of cluster equals random number from 5 - 10
sales['sales_cluster'] = km_sales.labels_
# Sillhouete scores
silhouette_score(sales[clusters], km_sales.labels_ )
cluster_colors: List[str] = np.array( ['red', 'blue', 'gold', 'darkviolet', 'lime', 'yellow', 'green', 'blueviolet', 'lime', 'tomato', 'orangered','aqua','cyan', 'turquoise','steelblue'] )
cluster_colors
sales.groupby('sales_cluster').mean()
# Factorize catogries attributes
sales['Map_Payment'] = pd.factorize(sales['Payment'])[0] # PAYMENT
sales['Map_Gender'] = pd.factorize(sales['Gender'])[0]
sales['Map_City'] = pd.factorize(sales['City'])[0]
sales['Map_Branch'] = pd.factorize(sales['Branch'])[0]
sales['Map_Product_Line'] = pd.factorize(sales['Product line'])[0]
sales['Map_Customer_Type'] = pd.factorize(sales['Customer type'])[0]
col_names: List[str] = sales.columns.tolist() # store the colum names
cluster_centers = sales.groupby('sales_cluster').mean()
plt.scatter(sales['Map_City'], sales['gross income'], c=cluster_colors[sales['sales_cluster']], s=100 )
plt.scatter(cluster_centers['Map_City'], cluster_centers['gross income'], linewidths=1, marker="X", c='black',s=300 )
plt.xlabel("City")
plt.ylabel("Gross Income")
So, we can see that City 1 (Yangon) has most number of cluster
table_colors = pd.DataFrame.from_dict( cluster_colors )
table_colors.columns = ['Table Colors']
table_colors
cluster_centers = sales.groupby('sales_cluster').mean()
plt.scatter(sales['Rating'], sales['Total'], c=cluster_colors[sales['sales_cluster']], s=25 )
plt.scatter(cluster_centers['Rating'], cluster_centers['Total'], linewidths=1, marker="X", c='black',s=100 )
plt.xlabel("Rating")
plt.ylabel("Total")
sales['Total'].max()
plt.scatter(sales['gross income'], sales['Rating'], c=cluster_colors[sales['sales_cluster']], s=25 )
plt.scatter(cluster_centers['gross income'], cluster_centers['Rating'], linewidths=1, marker="X", c='black',s=100 )
plt.xlabel("Gross Income")
plt.ylabel("Rating")
pd.plotting.scatter_matrix(sales[clusters], c=cluster_colors[sales['sales_cluster']], figsize=(50,50), s=75)
cluster_ranges = range(5,100)
inertias: List[float] = []
silhouette_scores: List[float] = []
for n_cluster in cluster_ranges:
kmeans = KMeans(n_clusters=n_cluster)
kmeans.fit(sales[clusters])
inertias.append(kmeans.inertia_)
silhouette_scores.append(silhouette_score(sales[clusters], kmeans.labels_))
# Datframe Silhouette Scores and Inertias
cluster_scores = pd.DataFrame(np.array([inertias, silhouette_scores]).transpose() , columns=['Inertias', 'Silhouete Scores'], index=range(5,100))
cluster_scores[ cluster_scores['Silhouete Scores'] == cluster_scores['Silhouete Scores'].max() ] # The best n_cluster = 7
cluster_scores['Inertias'].plot()
plt.title("Clusters Inertias")
plt.grid(True)
cluster_scores['Silhouete Scores'].plot()
plt.xticks(range(5,101,5))
plt.xlabel("Number of clusters")
plt.ylabel("Silhouette Coefficients")
scaled_models: List[float] = []
for n_cluster in range(5,101):
kmeans = KMeans(n_clusters=n_cluster, random_state=np.random.randint(100,150))
kmeans.fit( StandardScaler().fit_transform(sales[clusters]) )
scaled_models.append( silhouette_score(StandardScaler().fit_transform(sales[clusters]), kmeans.labels_) )
cluster_scores['scaled_silhouette'] = scaled_models[0:95]
plt.plot(scaled_models)
plt.xlabel("Number of Clusters")
plt.ylabel("Silhouette Coefficients")
plt.title("Scalled Models")
plt.xticks(range(5,101,5))
cluster_scores[ cluster_scores['scaled_silhouette'] == cluster_scores['scaled_silhouette'].max() ]
kmeans = KMeans(n_clusters=99, random_state=123)
kmeans.fit(StandardScaler().fit_transform(sales[clusters]))
A randomforest model is created and then the training data is fit to that model. Later the trained model is used to predict in order to validate the model. The mean absolute error is calculated to estimate the error difference between the predicted values and the actual values.
pd.DataFrame(sales.Payment.value_counts())
sales.rename(columns={'Product line':'Product_line'} , inplace=True)
plt.figure(figsize = (10,7));sales.Payment.value_counts().plot(kind = "pie" );plt.title("Payment Method")
plt.figure(figsize = (10,7));sales.Product_line.value_counts().plot(kind = "bar" );plt.title("Product line")
sns.lmplot(x = 'Rating' , y = 'Total' , data = sales, hue = 'Gender' , height = 7 , aspect = 1.5)
#Making Rating as Measure to Find Sales Features (Product-Line, Unit Price, Quantity, Tax, Customer type, City)
y = sales['Rating']
features = ['Product_line', 'Unit price','Quantity','Tax 5%','Customer type','City'] #sale Features
X = sales[features]
train_X,val_X,train_y,val_y = train_test_split(X,y,random_state = 1)
train_X= pd.get_dummies(train_X)
train_y= pd.get_dummies(train_y)
val_X= pd.get_dummies(val_X)
val_y= pd.get_dummies(val_y)
train_X.head()
leaf_model = RandomForestRegressor(random_state = 1)
leaf_model.fit(train_X,train_y)
predict_value = leaf_model.predict(val_X)
predict_value
mean_error = mean_absolute_error(val_y,predict_value)
print('The Mean Absolute Error is : {}'.format(mean_error))
Now we are going to check for outliers or values that out of bound by using anomaly detection. In this section, we are going to use box-plot and scatter diagram to check the outliers in some columns.
#Importing Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib
from sklearn.ensemble import IsolationForest
#Plotting a histogram diagram for Checking Gross Margin Percentage
plt.scatter(range(sales.shape[0]), np.sort(sales['gross margin percentage'].values))
plt.xlabel('index')
plt.ylabel('Sales')
plt.title("Sales distribution")
sns.despine()
print("Skewness: %f" % sales['gross margin percentage'].skew())
print("Kurtosis: %f" % sales['gross margin percentage'].kurt())
isolation_forest = IsolationForest(n_estimators=100)
isolation_forest.fit(sales['gross margin percentage'].values.reshape(-1, 1))
xx = np.linspace(sales['gross margin percentage'].min(), sales['gross margin percentage'].max(), len(sales)).reshape(-1,1)
anomaly_score = isolation_forest.decision_function(xx)
outlier = isolation_forest.predict(xx)
plt.figure(figsize=(10,4))
plt.plot(xx, anomaly_score, label='anomaly score')
plt.fill_between(xx.T[0], np.min(anomaly_score), np.max(anomaly_score),
where=outlier==-1, color='r',
alpha=.4, label='outlier region')
plt.legend()
plt.ylabel('anomaly score')
plt.xlabel('Gross Margin Percentage')
plt.show();
sales
print(sales.shape)
#Checking Average Rating of Product Line in Box-Plot
ax = sns.boxenplot(y = "Total", x = "Rating", data = sales)
ax.set_title("Average rating of product line", fontsize = 25)
ax.set_xlabel("Rating", fontsize = 16)
ax.set_ylabel("Product_line", fontsize = 16)
#Total Sales of Product
plt.figure(figsize=(14,6))
ax = sns.boxenplot(y= "Product_line", x= "Total", data = sales)
ax.set_title(label = " Total sales of product", fontsize = 25)
ax.set_xlabel(xlabel = "Total sales", fontsize = 16)
ax.set_ylabel(ylabel = "Product Line", fontsize = 16)
#Making Scatter Diagram
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import numpy as np
#Setting Total column and Product Line to check for anomalies
x = sales['Total']
y = sales['Product_line']
plt.plot(x, y, 'o', color='Blue');
a. Busiest Days are Holidays and New Year
b.Average money spend in a single transaction People spend in avearge about $322.97
c.Sales Members vs Non-Members Members: 1040, Non-Members: 1020
d.Branch avearge profit each months
e.Branch with the most customer but lower profit
Branch: (min,count)=> A:(12.6945,340), B:(18.6375, 332), C:(10.6785, 328)
f.Product lines sales in January , February and March
January- Food & Beverage, February- Fashion Accessories, March- Home & Lifestyle
Descriptions: Clusters based on the columns we made clusters of Relationship: Gross Income vs Store Locations, Rating vs Total, Rating vs Gross Income
Scatter Matrix Clusters to check Sale Feature and We found that best cluster is n=7 by using elbow method
In the random forest regression, we found out the most sold goods are Fashon Accessories and most used payment methods Cash and E-Wallet. Also, we tried to see all the Sales Features(Product-Line, Unit Price, Quantity, Tax, Customer type, City) in term of Customer Rating.
The mean absolute error is one of a number of ways of comparing forecasts with their eventual outcomes. The Mean Absolute Error is : 0.032192786885245864 which very low error.
For anomaly detection, We used Histomgram, Scatter and Box-Plot to check the anomaly values. Also we verified the Gross Margin Percentage by using Isolation Forest Technique to check anomaly.
By using, all of these result, we can do many to things to improvise the supermaket sales.